This dataset entails data from 2008-2022 on the real Residential Property Prices for Portugal. Data was downloaded from: https://fred.stlouisfed.org/series/QPTR628BIS
Reference: Bank for International Settlements, Real Residential Property Prices for Portugal [QPTR628BIS], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/QPTR628BIS, June 15, 2023.
'data.frame': 60 obs. of 2 variables:
$ DATE : chr "1/1/2008" "4/1/2008" "7/1/2008" "10/1/2008" ...
$ QPTR628BIS: num 103.4 101.9 99.4 98.2 100 ...
house_pricesbck <- house_prices #backup
Load packages
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggplot2")library("plotly")
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library("lubridate")library("stats")
We have two variables:
DATE: character; date of recorded average price value of houses in Portugal
QPTR628BIS: character; recorded average price value of houses in Portugal. However, this variable has its values separated with the thousand separator. This is ok, we must however keep in mind that the values are expressed in thousands.
We must now make date be read as a date variable in order to accuratly evaluate how the prices have fluctuated over time.
Min. 1st Qu. Median Mean 3rd Qu. Max.
80.04 87.44 99.85 104.47 116.81 147.70
Minimum value for a residential property in Portugal from 2008-2022 was around 80K euros. Mean value is 104K euros. Maximum value registered was of 147.7K euros.
Let’s evaluate the values dispersion through a boxplot
For better visualization purposes we will use plotly.
fig <-plot_ly(house_prices, x =~price, color =~as.factor(year), type ="box")fig
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
We can see that the prices have been increasing since 2017, with 2021 being the year where values most fluctuated throughout the year. It seems that there are differences between the pre COVID years (<2020 & >2016 ) and the post COVID years (>= 2020) in terms of prices. However, let’s prove this theory through statistical testing.
Let’s divide our data into pre and post covid dates.
prices_pre <- house_prices %>%filter(year <2020& year >2016)unique(prices_pre$year)
Our results are clearly compatible with the histograms above, which is expected.
Test normality
The null hypothesis of these tests is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.
Since we have quite a small sample size for the post-covid timeframe we have to be aware that normality tests are sensitive to sample size. Small samples most often pass normality tests which is why we combined visual inspection and significance test in order to take the right decision.
shapiro.test(prices_pre$price)
Shapiro-Wilk normality test
data: prices_pre$price
W = 0.95219, p-value = 0.6692
From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume normality.
shapiro.test(prices_pro$price)
Shapiro-Wilk normality test
data: prices_pro$price
W = 0.87883, p-value = 0.08468
From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume normality.
However, this seems to be a case influenced by the small sample size which affects the sensitivity of the test. When we check with the plots above we can tell that the data is not normally distributed.
Compare price data before and after COVID
In order to choose the correct test to use we must take into consideration that: - Our data is not normally distributed - Our samples are not independent (they are related as they were measured over time)
When we have paired or related samples, we must consider using the Wilcoxon signed-rank test, which is appropriate when comparing two related groups with non-normally distributed data.
res <-wilcox.test(prices_pre$price, prices_pro$price, paired =TRUE)res
Wilcoxon signed rank exact test
data: prices_pre$price and prices_pro$price
V = 0, p-value = 0.0004883
alternative hypothesis: true location shift is not equal to 0
The p-value of the test is 0.0004883, which is less than the significance level alpha = 0.05. We can conclude that the median price of the houses in Portugal before COVID is significantly different from the median prices after COVID with a p-value = 0.0004883.